An effective drug-disease associations prediction model based on graphic representation learning over multi-biomolecular network

Background Drug-disease associations (DDAs) can provide important information for exploring the potential efficacy of drugs. However, up to now, there are still few DDAs verified by experiments. Previous evidence indicates that the combination of information would be conducive to the discovery of new DDAs. How to integrate different biological data sources and identify the most effective drugs for a certain disease based on drug-disease coupled mechanisms is still a challenging problem. Results In this paper, we proposed a novel computation model for DDA predictions based on graph representation learning over multi-biomolecular network (GRLMN). More specifically, we firstly constructed a large-scale molecular association network (MAN) by integrating the associations among drugs, diseases, proteins, miRNAs, and lncRNAs. Then, a graph embedding model was used to learn vector representations for all drugs and diseases in MAN. Finally, the combined features were fed to a random forest (RF) model to predict new DDAs. The proposed model was evaluated on the SCMFDD-S data set using five-fold cross-validation. Experiment results showed that GRLMN model was very accurate with the area under the ROC curve (AUC) of 87.9%, which outperformed all previous works in terms of both accuracy and AUC in benchmark dataset. To further verify the high performance of GRLMN, we carried out two case studies for two common diseases. As a result, in the ranking of drugs that were predicted to be related to certain diseases (such as kidney disease and fever), 15 of the top 20 drugs have been experimentally confirmed. Conclusions The experimental results show that our model has good performance in the prediction of DDA. GRLMN is an effective prioritization tool for screening the reliable DDAs for follow-up studies concerning their participation in drug reposition.


Introduction
Drugs can relieve the symptoms of illness, control the further development of the disease, and help the body to recover. Owning to the increasingly abrupt outbreak of diseases, the demand for new drugs is also on the rise. For example, the sudden outbreak of COVID-19 requires researchers to develop drugs and vaccines in a short period of time. Drug repositioning can effectively reduce the cost of drug development by more than half. Although many researchers have proposed some models for predicting drug-disease associations for drug reposition, how to effectively extract drug-disease association information is still a challenging problem. Analyzing the complex association between drugs and diseases from the microscopic perspective of biomolecules in cells can provide new insights for exploring the mechanism of disease.
Through the integration of large-scale genomic and protein data, a network model is constructed. This provides new ideas for predicting the association between disease molecules and drug molecules. The emergence of network-based predictive approaches not only comprehensively synthesizes associations among protein, miRNA, lncRNA, diseases, and drugs, but also provides a promising computational tool for determining new DDAs and repositioning drugs.
There have been many studies on predicting drug repositioning, including some network-based models. For example, Yu et al. proposed to use Layer Attention Graph Convolutional Network (LAGCN) to predict DDA, which use the graph convolution to learn DDA, drug-drug similarity and disease-disease similarity, and use the attention mechanism to combine multiple graph convolutions layers [1]. SCMFDD is a DDA prediction method based on matrix factorization, which maps drug-disease associations into lowrank space and introduces disease semantic similarity and drug similarity increase constraints [2]. Zhang et al. used a binary network to predict DDAs, selecting only drugs and disease information [3]. Researchers are gradually solving the computational problem of drug repositioning from a macro perspective, but previous studies of DDA prediction have not considered the whole cell. The FSPGA algorithm proposed by He et al. can effectively detect more meaningful clustering hidden in the attribute graph, taking into account the topology structure and attribute value of the graph [4]. CCPMVFGC proposed by He et al. which can well capture the contextual interdependency of features in each cluster by combining graph clustering with multi-view learning [5]. The MrSBM model proposed by He et al. performs unsupervised learning tasks in network data. In addition to modeling edges located within blocks or connecting blocks, MrSBM also considers modeling vertex features using vertex-clustering preferences and probability of feature-clustering contributions [6].
In previous studies on DDA, some have considered adding an "intermediate bridge" molecule (such as miRNA and protein) between drugs and diseases [7]. With regard to this idea of adding intermediate biomolecules to search for DDA, whether adding more types of biomolecules and the following higher complexity of the MAN network will guarantee a better effect of DDA prediction? In fact, the combination of two biomolecules is a complicated law, and it is not the case that a better DDA prediction effect can be assured with the increase of the number of the intermediate biomolecules. If multiple types of biomolecules data are introduced into the DDA prediction model, most of them will be equivalent to noise, which will directly affect the prediction results. Based on the previous studies of miRNA-disease associations, lncRNA-disease associations, drugprotein associations, and disease-protein associations, we have designed a DDA prediction model that uses protein, lncRNA, and miRNA as intermediate molecules. As shown in Fig. 1, there are 9 confirmed associations among the five biomolecules [8].
Graphs are one of the most powerful framework in algorithms, and can be used to represent almost all types of structures or systems. Different biomolecules and their interactions can be viewed as vertices (nodes) and links (edges) in a graph [9]. Based on the above, in this paper, we constructed a molecular association network (MAN), including miRNA, lncRNA, protein, drug, disease, and nine associations (lncRNAprotein interaction [10], drug-protein association [11], protein-protein interaction [12,13], protein-disease interaction [14], miRNA-disease association [15], miRNA-disease association [16], miRNA-lncRNA association [17], lncRNA-disease interaction [18], and drug-disease association [19]). Each node in the MAN is composed of the attribute of the node itself and the associated information with other nodes. Node information includes drug molecular fingerprint, disease semantic information, ncRNA sequence, and protein sequence [20]. A unique feature of GRLMN combines five biomolecules and nine molecular associations [21]. Although this paper mainly solves the problem of drug repositioning, GRLMN has better scalability and can predict the association between other molecules using the proposed network model [22]. Figure 2 shows the workflow of GRLMN model, in which the complex network of biomolecules consists of two parts: nodes (drug, disease, protein, miRNA, and lncRNA) and edges (the relationship of nodes) [23].
To evaluate the ability of the GRLMN to predict DDAs, fivefold cross-validation method was performed on SCMFDD-S data set [24]. Through the comparison with different feature models and classifier models, the proposed model achieved good results [25]. In addition, we also tested the validity of the model for two human diseases, including Kidney disease and Fever [26]. As a result, among the top 20 drugs predicted by GRLMN that are related to kidney disease or fever, 15 have been verified in the comparative toxicogenomics database (CTD) [27]. Experiment results show that the proposed model combines node attribute information and mode information to obtain effective robust prediction performance [28]. Complex molecular association networks allow us to understand biology and disease pathology from a global perspective.

Multi-biomolecular associations data
In this work, the SCMFDD-S data set collected by Zhang et al. [29] is used for training, which includes 269 drugs, 598 diseases, and 18,416 DDAs. DrugBank [30] is a comprehensive database of extensive drug information, providing SMILE for drugs. We use python packages to convert SMILE to Morgan fingerprints. In addition, as shown in Table 1, we downloaded eight types of heterogeneous associations from nine other databases, 8374 pairs of miRNA-lncRNA association provided by lncRNASNP2 database, 16,427 pairs of miRNA-disease association provided by HMDD database [31], 4944 pairs of miRNA-protein association provided by miRTarBase database [32], and 1264 pairs of lncRNA-disease association provided by LncRNADisease [33] and lncRNASNP2 [34] databases. LncRNA2Target [35], DisGeNET [36], DrugBank, and STRING [37] provided 690 pairs of lncRNA-protein associations, 25,087 pairs of protein-disease associations, 11,107 pairs of drug-protein associations, and 19,237 pairs of protein-protein interactions [38][39][40]. After unifying identifiers, eliminating redundancy, simplify, and deleting irrelevant items, the downloaded experimental data are sorted out and obtained in Table 2.

Disease descriptors
In order to represent the similarity between diseases, we calculated disease semantic similarity by referring to the MeSH database [41], which developed by the National Library of Medicine (NLM). The MeSH database categorizes diseases strictly and accurately. Each disease we download from https:// www. nlm. nih. gov/ has a descriptor that can construct a directed acyclic graph (DAG) to describe the disease. Specifically, for disease e , and its DAG can be described as DAG e = (e, N e , D e ) , where N e represents the set of diseases associated with disease e , and D e represents the set of edges between them. The contribution of a certain disease d to the semantic value of disease e in the set N e is: where ε is a contribution parameter. The semantic value DV (e) can be obtained by adding up the contribute values of all diseases in the disease set N e , and its formula is as follows [42]:  Assume that the more DAGs shared by two diseases, the more similar they are. Based on this assumption, diseases semantic similarity is calculated according to the relative positions of diseases e(i) and e(j):

NcRNA and protein sequence descriptors
In order to standardize and characterize the ncRNA transcription and protein sequences, we use 3-mer to analyze each sequence. As shown in Fig. 2, in order to facilitate the coding of proteins and ncRNA, we divided the 20 amino acids and the four nucleotides into 4 groups.  [43]. The grouping of ncRNAs is adenine (A), cytosine (C), guanine (G), and uracil (U). As shown in Fig. 2c, we calculate the frequency of each different amino acid or RNA combination through a sliding window of length 3. Here, we can express a 64 (4 3 ) dimensional vector through 3-mer.

Stacked auto-encoder
As shown in Fig. 2b, the SIMLES (simplified molecular input line entry specification) of the drug can be found in the DrugBank database. The RDkit python package can convert SIMLES into Morgan fingerprints [44,45]. In this work, Stacked Auto-encoder (SAE) is introduced to extract the constructed Morgan fingerprints. As shown in Fig. 3a, auto-encoder is a kind of symmetric neural network, which belongs to semi-supervised learning, and its learning function is represent the weights and biases. Figure 3b shows the structure of a stacked auto-encoder with an h-stage auto-encoder. The vector output by the first auto-encoder layer is used as the vector of the second autoencoder layer input until the output vector of the top autoencoder layer is obtained. The random gradient descent was selected for training. Drug molecular fingerprints obtain a vector characterizing molecular structure by stacking autoencoder.

Node representation
In the MAN, each node is composed of two parts, one is the attributes of the node itself, and the other is the association with other nodes. Attributes of the node itself include ncRNA sequences, protein sequences, semantic information of disease, and drug fingerprints. Specifically, the network representation learning is used to calculate the association between nodes and other nodes which can globally represent the information flow between the entire network nodes. Due to the sparseness and discreteness of the MAN network, we urgently need a simple and efficient low-dimensional representation method to represent it, and graph embedding is such a method. As the current mainstream network embedding algorithm, LINE [46] can embed large-scale information networks into low-order vector spaces and is suitable for any type of information network. LINE is a method based on the assumption of neighborhood similarity, which can be seen as an algorithm that use Breath First Search (BFS) to construct neighborhoods. A major feature of LINE is that it optimizes the goal of preserving local nodes and global network structure. LINE combines the first-order similarity and second-order similarity in the graph structure to obtain richer graph representation results. Figure 2d explains first-order and second-order. The thickness of the edge represents the value of the weight. Because node 6 and node 7 are directly connected and have a larger weight, their first-order similarity is higher. In the MAN network, the weights of the edges are all equal. Node 5 and node 6 are not directly connected, but they share a common adjacent node, so their embedding should have a similar distance and a greater second-order similarity.
First-order is to model each undirected edge. First, calculate the probability distribution of node transition. For each directed edge (a, b) , we first define the probability that the neighbor of vertex v a is v b as: where u a and u b are the embedding vector representations of node a and node b , respectively. According to the weights of the edges, the empirical distribution can also be obtained: Fig. 3 The structure of the auto-encoder and structure of the stacked auto-encoder where W is the sum of the weights of the edges in the graph. In order to keep the empirical distribution similar to the probability distribution, we use KL divergence to measure the similarity of the two distributions. After we remove the constant term, the loss function obtained is as follows: therefore, as long as the L 1 is minimized, we can guarantee the first-order similarity of node embedding in the graph. Second-order applies to both directed and undirected graphs. We first define the probability distribution of node transition: where |V | is the number of vertices, u a is the representation when v a is regarded as vertex and ú a is the representation of v a when it is treated as a specific "context". At the same time, the second-order empirical distribution is defined as follows: where d a is the output degree of node a and N (i) is the adjacent node of node i.
To make sure the empirical distribution and the probability distribution similar. we use KL divergence to measure the similarity of the two distributions. After removing the constant term and performing a series of approximations, we get the loss function as follows:

Random forest
Ensemble learning has been widely used in bioinformatics, the idea of which is to combine multiple single classifiers into a new classifier to obtain better classification effect. We choose the random forest classifier in the ensemble learning algorithm to classify and predict the drug-disease association [47]. Random forest can avoid the problem of decision tree overfitting. Compared with other single classifiers, it usually has more stable prediction performance [48]. Since stability and accuracy are very important for large-scale prediction of drugs-diseases association, in this work, random forest was selected as the classifier to process the extracted features.

Evaluation criteria
In order to verify the prediction ability of GRLMN, fivefold cross-validation method was performed on the real data set collected in Table 1 in the experiment. Specifically, fivefold cross-validation is to randomly divide the sample into 5 subsets of the same number. Each time a subset is selected as the test set, and the remaining subsets are used as the training set. The training process is repeated five times so that each subset could be used as the test set, and the average of the five groups is used as the finally result. To quantify the results of fivefold cross-validation, we selected five kinds of evaluation criteria, including sensitivity (SEN), specificity (SPE), precision (PRE) accuracy (ACC) and Matthews correlation coefficient (MCC). The calculation formula is as follows: where TP is true positive, FP is false positive, TN is true negative and FN is false negative. For further evaluation, we also compute the receiver operating characteristic (ROC) curve, sum up the ROC curve in a numerical way, and calculate the area under the ROC curve (AUC).

Evaluate prediction performance
In this section, fivefold cross-validation method was performed on the SCMFDD-S data set to evaluate the ability of the proposed model to predict DDAs. Table 3 shows that in the experiment on the SCMFDD-S data set, GRLMN yielded the average accuracy, sensitivity, specificity, and precision of GRLMN are all around 80%, and the Matthews correlation coefficient is 59.68%. In a huge network of nine biomolecule association relationships, all indicators can perform well, which shows that GRLMN has good predictive ability by fusing molecular features.
As mentioned in "Node representation" section, GRLMN calculates the association between each node and other nodes through LINE algorithm to predict DDA. In this section, we also evaluated the effectiveness of the introduction of node association information and node attribute information. We call the model that only uses the attributes  . 4 Performance yielded GRLMN in DDA prediction: a ROC curves yielded by GRLMN using fivefold cross-validation on SCMFDD-S data set. b ROC curves yielded by GRLMN_A using fivefold cross-validation on SCMFDD-S data set. c ROC curves yielded by GRLMN_M using fivefold cross-validation on SCMFDD-S data set of the node itself as GRLMN_A, and the model that only uses the associated attributes of the node as GRLMN_M. As shown in Table 3 and Fig. 4, without using the node's own attribute features, the prediction performance of GRLMN_M in fivefold cross-validation is significantly reduced, but all indicators are still higher than those in GRLMN_A. The comparison results showed that the attributes of the node itself and the associated attributes of the node in GRLMN were closely related and mutually beneficial to the prediction task.

Impact of different graph embedding on GRLMN
Graph Embedding has been widely used in recommender systems and computational advertising, and the corresponding algorithms are constantly being extended. In this section, we discuss the difference between applying LINE and Node2vec in the GRLMN model. Node2vec adjusts the weights of random walks to make the results of graph embedding weighed in the homophily and structural equivalence of the network. Specifically, the "homophily" of the network means that the embedding of nodes that are close to each other should be as close as possible, and the "structural equivalence" means that the embedding of nodes that are structurally similar should be as close as possible.
Based on the control variable method, we replace the LINE part of GRLMN with Node2vec, and the rest remain unchanged. For the sake of distinction, we call GRLMN based on Node2vec as GRLMN-node2vec, and GRLMN based on LINE as GRLMN-LINE. Figure 5a is the fivefold cross-validation AUC curve of GRLMN-node2vec on the SCMFDD-S data set. Figure 5b is the ROC curves yielded by GRLMN-node2vec containing only attribute using fivefold cross-validation on SCMFDD-S data set. The AUC result of GRLMN-node2vec is 0.18% higher than that of GRLMN-node2vec which only contains attribute features, but its performance is still inferior to GRLMN-LINE. LINE is based on the edge sampling algorithm to improve and optimize the objective function, which overcomes the limitations of the traditional stochastic gradient descent algorithm, so the effect will be better.

Performance comparison
To further verify the performance of GRLMN in predicting DDA, we performed fivefold cross-validation of the other six models on the same data set. SCMFDD model proposed by Zhang et al. [29], which proposed mapping the association between drugs and diseases to two low-rank spaces, using matrix decomposition to predict associations. Table 4 shows the average AUC value of the other six models and our method. From the table we can see that GRLMN achieves a higher average AUC value on SCMFDD-S data set. In the SCMFDD-S data set, the AUC obtained by the proposed model was the highest, 0.78% higher than the AUC generated by LNS, 0.53% higher than SCMFDD-Drug interaction, 1.16% higher than SCMFDD-Enzyme, 0.81% higher than SCMFDD-Pathway, 3.77% higher than SCMFDD-Target, and 0.5% higher than SCMFDD-Substructure. The Experimental results show that GRLMN has more advantages. Unlike the comparison method, GRLMN is more extensible, which uses the attribute of five biological molecules and their association to form a molecular association network. We integrate more comprehensive molecular information to achieve significant prediction results.

Impact of different classifier on GRLMN
GRLMN use random forest to make predictions based on feature fusion. In this section, we evaluate the effectiveness of random forest. Specifically, we use Adaboost classifier, Logistic Regression classifier, and Naïve Bayes classifier to replace of random forest classifier to compare the effectiveness of GRLMN and the combination of these classifiers. According to the control variable method, all kinds of experimental data are the same except for  different classifiers. In order to make results more credible, fivefold cross-validations were performed on the four models simultaneously. Use grid search to find the best parameters of random forest: n_estimators = 100, max_depth = 110. Adaboost classifier, Logistic Regression classifier, and Naive Bayes classifier all adopt default parameters. Table 5 and Fig. 6 show the results of combining the random forest classifier, the Adaboost classifier, the Logistic regression classifier, and the Naive Bayes classifier with the proposed feature descriptors. Adaboost classifier achieved accuracy, sensitivity, specificity, precision, MCC, and AUC of 70.82%, 71.30%, 70.34%, 70.62%, 41.65%, and 78.05%, respectively. Their standard deviations are 0.35%, 1.15%, 0.88%, 0.41%, 0.71%, and 0.52%. Logistic regression classifier achieved accuracy, sensitivity, specificity, precision, MCC, and AUC of 72.95%, 72.98%, 72.92%, 72.94%, 45.91%, and 80.41%, respectively. Their standard deviations are 0.45%, 0.99%, 0.68%, 0.44%, 0.91%, and 0.54%. Naïve Bayes classifier achieved accuracy, sensitivity, specificity, precision, MCC and AUC of 68.27%, 70.86%, 65.69%, 67.37%, 36.60%, and 74.18%, respectively. Their standard deviations are 0.55%, 0.86%, 0.76%, 0.53%, 1.10%, and 0.62%. It can be seen from the comparison that the classification results of random forest classifier are superior to the other four classifiers. The average AUC of the random forest is 9.85%, 7.49%, and 13.72% higher than that of Adaboost classifier, Logistic Regression classifier, and Naive Bayes classifier, respectively.

Case study
To further evaluate the ability of GRLMN to predict potential associations, we select kidney disease and fever as cases for experiments. Specifically, we use the SCMFDD-S dataset to train the model. When predicting associations for specified diseases, all associations between specified diseases and drugs in the data set are deleted. According to the prediction results of GRLMN, we validated the top 20 drugs with predicted scores in the independent CTD database. Kidney disease is usually caused by factors such as infection, genetics, and immunity. As shown in Table 6, we validated the top 20 drugs for Kidney disease prediction in the CTD database and identified 15 of them. Fever is a state in which abnormal body temperature or excessive heat production and heat dissipation caused by various reasons, resulting in a rise in body temperature beyond the normal range. The top ranked drugs related to fever predicted by the GRLMN model are listed in Table 7.
Comparing the prediction results with the CTD database, 15 of them were confirmed. Associations not listed in the CTD database may actually exist but are not currently verified.

Conclusion
Drug reposition requires a lot of theoretical support from DDA, so it is a meaningful work to develop an algorithm for predicting DDA. In this paper, the association among drug, disease miRNA, lncRNA, and protein were integrated, and the multi-biomolecular network was constructed from the perspective of cells.
In the experimental, we evaluated GRLMN model on SCMFDD-S data set using the fivefold cross-validation method. Experimental results show that the proposed model is highly accurate in predicting drug indications and significantly superior to other methods. In addition, case studies of Kidney disease and Fever have shown that GRLMN has outstanding performance in predicting a list of potential drugs associated with a particular disease. Our prediction model can be applied to the prediction of actual DDA problems. The experimental results show that the large-scale association prediction network based on machine learning model not only supplements the artificial experiment, but also opens up a macroscopic perspective to predict the association between molecules. Similar to the general machine learning framework, there are inevitable disadvantages. When new nodes are added, the network needs to learn the feature again. The addition of new nodes should meet certain conditions: 1.
The new node must be linked to the original network and cannot be an isolated node; 2. The more links between new nodes and nodes in the network, better features can be learned; However, the time cost of feature relearning is not very high, and now powerful machine performance can deal with this problem quickly.